Online Markov Decision Processes Under Bandit Feedback

نویسندگان
چکیده

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Online Markov Decision Processes

We consider a Markov decision process (MDP) setting in which the reward function is allowed to change after each time step (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well an agent can do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which ha...

متن کامل

Online Stochastic Optimization under Correlated Bandit Feedback

In this paper we consider the problem of online stochastic optimization of a locally smooth function under bandit feedback. We introduce the high-confidence tree (HCT) algorithm, a novel any-time X -armed bandit algorithm, and derive regret bounds matching the performance of existing state-of-the-art in terms of dependency on number of steps and smoothness factor. The main advantage of HCT is t...

متن کامل

A Multi-Armed Bandit Approach for Online Expert Selection in Markov Decision Processes

We formulate a multi-armed bandit (MAB) approach to choosing expert policies online in Markov decision processes (MDPs). Given a set of expert policies trained on a state and action space, the goal is to maximize the cumulative reward of our agent. The hope is to quickly find the best expert in our set. The MAB formulation allows us to quantify the performance of an algorithm in terms of the re...

متن کامل

PAC Bounds for Multi-armed Bandit and Markov Decision Processes

The bandit problem is revisited and considered under the PAC model. Our main contribution in this part is to show that given n arms, it suffices to pull the arms O( n 2 log 1 δ ) times to find an -optimal arm with probability of at least 1 − δ. This is in contrast to the naive bound of O( n 2 log n δ ). We derive another algorithm whose complexity depends on the specific setting of the rewards,...

متن کامل

Online Markov decision processes with policy iteration

The online Markov decision process (MDP) is a generalization of the classical Markov decision process that incorporates changing reward functions. In this paper, we propose practical online MDP algorithms with policy iteration and theoretically establish a sublinear regret bound. A notable advantage of the proposed algorithm is that it can be easily combined with function approximation, and thu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Automatic Control

سال: 2014

ISSN: 0018-9286,1558-2523

DOI: 10.1109/tac.2013.2292137